Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows researchers to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. Furthermore, we propose a Transformer-based GOT tracker TaMOS capable of joint processing of multiple objects through shared computation. TaMOs achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. Finally, TaMOs achieves highly competitive results on single-object GOT datasets, setting a new state-of-the-art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available.
translated by 谷歌翻译
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size and significant accuracy improvement across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves better performance in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with higher throughput, higher energy efficiency and comparable area efficiency.
translated by 谷歌翻译
Multi-modal robots expand their operations from one working media to another, land to air for example. The majorities multi-modal robots mainly refer to platforms that operate in two different media. However, for all-terrain tasks, there is seldom research to date in the literature. In this paper, we proposed a triphibian robotic platform aiming at solving the challenges of different propulsion systems and immensely varied working media. In our design, three ducted fans are adopted to unify the propulsion system and provide the robot with driving forces to perform all-terrain operations. A morphable mechanism is designed to enable the transition between different motion modes, and specifically, a cylindrical body is implemented as the rolling mechanism in land mode. Detailed design principles of different mechanisms and the transition between various locomotion modes are analyzed in detail. Finally, a triphibian robot prototype is fabricated and tested in various working media with mono-modal and multi-modal functionalities. Experiments have verified our platform, and the results show promising adaptions for future exploration tasks in different working scenarios.
translated by 谷歌翻译
Legal judgment Prediction (LJP), aiming to predict a judgment based on fact descriptions, serves as legal assistance to mitigate the great work burden of limited legal practitioners. Most existing methods apply various large-scale pre-trained language models (PLMs) finetuned in LJP tasks to obtain consistent improvements. However, we discover the fact that the state-of-the-art (SOTA) model makes judgment predictions according to wrong (or non-casual) information, which not only weakens the model's generalization capability but also results in severe social problems like discrimination. Here, we analyze the causal mechanism misleading the LJP model to learn the spurious correlations, and then propose a framework to guide the model to learn the underlying causality knowledge in the legal texts. Specifically, we first perform open information extraction (OIE) to refine the text having a high proportion of causal information, according to which we generate a new set of data. Then, we design a model learning the weights of the refined data and the raw data for LJP model training. The extensive experimental results show that our model is more generalizable and robust than the baselines and achieves a new SOTA performance on two commonly used legal-specific datasets.
translated by 谷歌翻译
多模式情感分析由于其在多模式相互作用中的信息互补性而具有广泛的应用。以前的作品更多地着重于研究有效的联合表示,但他们很少考虑非峰值提取和多模层融合的数据冗余性的不足。在本文中,提出了一个基于视频的跨模式辅助网络(VCAN),该网络由音频特征映射模块和跨模式选择模块组成。第一个模块旨在大大提高音频功能提取的特征多样性,旨在通过提供更全面的声学表示来提高分类精度。为了授权该模型处理冗余视觉功能,第二个模块是在集成视听数据时有效地过滤冗余视觉框架的。此外,引入了由几个图像分类网络组成的分类器组,以预测情感极性和情感类别。关于RAVDESS,CMU-MOSI和CMU-MOSEI基准的广泛实验结果表明,VCAN明显优于提高多模式情感分析的分类准确性的最新方法。
translated by 谷歌翻译
情绪原因对提取(ECPE)是一项新的任务,旨在从文档中提取潜在的情绪和相应原因。先前的方法重点是建模成对的关系并取得了令人鼓舞的结果。但是,从根本上象征文档的基本结构的条款与差异关系仍处于研究期。在本文中,我们定义了一个新的条款 - 差异关系。为了学习它,我们提出了一个名为EA-GAT的一般条款级编码模型,该模型包括E-GAT和激活排序。 E-GAT旨在从不同类型的子句中汇总信息;激活排序利用个人情感/原因预测和基于排序的映射将条款推向更有利的表示。由于EA-GAT是一个子句级编码模型,因此可以与任何以前的方法广泛集成。实验结果表明,我们的方法比当前的所有方法在中文和英语基准语料库中都具有显着优势,平均$ 2.1 \%$和$ 1.03 \%$ $。
translated by 谷歌翻译
Aiming at exploiting the rich information in user behaviour sequences, sequential recommendation has been widely adopted in real-world recommender systems. However, current methods suffer from the following issues: 1) sparsity of user-item interactions, 2) uncertainty of sequential records, 3) long-tail items. In this paper, we propose to incorporate contrastive learning into the framework of Variational AutoEncoders to address these challenges simultaneously. Firstly, we introduce ContrastELBO, a novel training objective that extends the conventional single-view ELBO to two-view case and theoretically builds a connection between VAE and contrastive learning from a two-view perspective. Then we propose Contrastive Variational AutoEncoder (ContrastVAE in short), a two-branched VAE model with contrastive regularization as an embodiment of ContrastELBO for sequential recommendation. We further introduce two simple yet effective augmentation strategies named model augmentation and variational augmentation to create a second view of a sequence and thus making contrastive learning possible. Experiments on four benchmark datasets demonstrate the effectiveness of ContrastVAE and the proposed augmentation methods. Codes are available at https://github.com/YuWang-1024/ContrastVAE
translated by 谷歌翻译
语义细分是智能车辆了解环境的重要任务。当前的深度学习方法需要大量的标记数据进行培训。手动注释很昂贵,而模拟器可以提供准确的注释。但是,在实际场景中应用时,使用模拟器数据训练的语义分割模型的性能将大大降低。对于语义分割的无监督域适应性(UDA)最近引起了越来越多的研究注意力,旨在减少域间隙并改善目标域的性能。在本文中,我们提出了一种新型的基于两阶段熵的UDA方法,用于语义分割。在第一阶段,我们设计了一个阈值适应的无监督局灶性损失,以使目标域中的预测正常,该预测具有轻度的梯度中和机制,并减轻了在基于熵方法中几乎没有优化硬样品的问题。在第二阶段,我们引入了一种名为跨域图像混合(CIM)的数据增强方法,以弥合两个域的语义知识。我们的方法在合成景观和gta5-to-cityscapes上使用DeepLabV2和使用轻量级的Bisenet实现了最新的58.4%和59.6%的MIOS和59.6%的Mious。
translated by 谷歌翻译
旨在从文本中检测事件并对其进行分类的事件检测(ED)对于理解现实生活中的实际情况至关重要。但是,主流事件检测模型需要触发器的高质量专家人类注释,这通常是昂贵的,因此阻止了ED在新领域的应用。因此,在本文中,我们专注于无触发器的低资源,并旨在应对以下艰巨的挑战:多标签分类,线索不足和事件分布不平衡。我们通过机器阅读理解(DRC)框架提出了一种新颖的无触发ED方法。更具体地说,我们将输入文本视为上下文,并将其与所有事件类型的令牌相连,后者被视为答案,并忽略了默认问题。因此,我们可以利用预训练的语言模型中的自我发作来吸收输入文本和事件类型之间的语义关系。此外,我们设计了一个简单而有效的事件毁灭模块(EDM),以防止大型事件过度学习,从而产生更平衡的训练过程。实验结果表明,我们提出的无触发ED模型与基于主流触发器的模型非常有竞争力,显示了其在低源事件检测上的强劲性能。
translated by 谷歌翻译